---
sidebar_position: 4
---

# Large databases

In order to write valid queries against a database, we need to feed the model the table names, table schemas, and feature values for it to query over.
When there are many tables, columns, and/or high-cardinality columns, it becomes impossible for us to dump the full information about our database in every prompt.
Instead, we must find ways to dynamically insert into the prompt only the most relevant information. Let's take a look at some techniques for doing this.

## Setup

First, install the required packages and set your environment variables. This example will use OpenAI as the LLM.

```bash
npm install langchain @langchain/community @langchain/openai typeorm sqlite3
```

```bash
export OPENAI_API_KEY="your api key"
# Uncomment the below to use LangSmith. Not required.
# export LANGCHAIN_API_KEY="your api key"
# export LANGCHAIN_TRACING_V2=true
```

The below example will use a SQLite connection with Chinook database. Follow these [installation steps](https://database.guide/2-sample-databases-sqlite/) to create `Chinook.db` in the same directory as this notebook:

- Save [this](https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql) file as `Chinook_Sqlite.sql`
- Run sqlite3 `Chinook.db`
- Run `.read Chinook_Sqlite.sql`
- Test `SELECT * FROM Artist LIMIT 10;`

Now, `Chinhook.db` is in our directory and we can interface with it using the Typeorm-driven `SqlDatabase` class:

import CodeBlock from "@theme/CodeBlock";
import DbCheck from "@examples/use_cases/sql/db_check.ts";

<CodeBlock language="typescript">{DbCheck}</CodeBlock>

## Many tables

One of the main pieces of information we need to include in our prompt is the schemas of the relevant tables.
When we have very many tables, we can't fit all of the schemas in a single prompt.
What we can do in such cases is first extract the names of the tables related to the user input, and then include only their schemas.

One easy and reliable way to do this is using OpenAI function-calling and Zod models. LangChain comes with a built-in `createExtractionChainZod` chain that lets us do just this:

import LargeDbExample from "@examples/use_cases/sql/large_db.ts";

<CodeBlock language="typescript">{LargeDbExample}</CodeBlock>

We've seen how to dynamically include a subset of table schemas in a prompt within a chain.
Another possible approach to this problem is to let an Agent decide for itself when to look up tables by giving it a Tool to do so.
You can see an example of this in the [SQL: Agents](/docs/use_cases/sql/agents) guide.

## High-cardinality columns

High-cardinality refers to columns in a database that have a vast range of unique values.
These columns are characterized by a high level of uniqueness in their data entries, such as individual names, addresses, or product serial numbers.
High-cardinality data can pose challenges for indexing and querying, as it requires more sophisticated strategies to efficiently filter and retrieve specific entries.

In order to filter columns that contain proper nouns such as addresses, song names or artists, we first need to double-check the spelling in order to filter the data correctly.

One naive strategy it to create a vector store with all the distinct proper nouns that exist in the database.
We can then query that vector store each user input and inject the most relevant proper nouns into the prompt.

First we need the unique values for each entity we want, for which we define a function that parses the result into a list of elements:

import HighCardinalityExample from "@examples/use_cases/sql/large_db_high_cardinality.ts";

<CodeBlock language="typescript">{HighCardinalityExample}</CodeBlock>

We can see that with retrieval we're able to correct the spelling and get back a valid result.

Another possible approach to this problem is to let an Agent decide for itself when to look up proper nouns.
You can see an example of this in the [SQL: Agents](/docs/use_cases/sql/agents) guide.
